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Abstract 

Erasure codes are an integral part of many distributed 
storage systems aimed at Big Data, since they provide 
high fault- tolerance for low overheads. However, tradi- 
tional erasure codes are inefficient on reading stored data 
in degraded environments (when nodes might be unavail- 
able), and on replenishing lost data (vital for long term 
resilience). Consequently, novel codes optimized to cope 
with distributed storage system nuances are vigorously 
being researched. In this paper, we take an engineering 
alternative, exploring the use of simple and mature tech- 
niques -juxtaposing a standard erasure code with RAID- 
4 like parity. We carry out an analytical study to deter- 
mine the efficacy of this approach over traditional as well 
as some novel codes. We build upon this study to de- 
sign CORE, a general storage primitive that we integrate 
into HDFS. We benchmark this implementation in a pro- 
prietary cluster and in EC2. Our experiments show that 
compared to traditional erasure codes, CORE uses 50% 
less bandwidth and is up to 75% faster while recovering 
a single failed node, while the gains are respectively 15% 
and 60% for double node failures. 

1 Introduction 

In order to meet the conflicting needs of high fault- 
tolerance and low storage overhead, erasure codes are 
increasingly being embraced for distributed storage sys- 
tem^ aimed to store high volumes of data. Traditional 
erasure codes have mostly been designed to optimize the 
performance of communication-centric applications, and 
are not necessarily amenable to the needs of storage sys- 
tems. Some such desirable properties include efficient 
replenishment of lost redundancy (repair) following the 
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^ Scaling-out (horizontal scaling) has become a norm to deal with 
very large scale storage, and this work is thus focused on distributed 
storage system architectures. We do not elaborate further the usual 
motivations and compulsions for such a system design choice. 



failure of some system components; and efficient access 
of data while the system is yet to complete remedial 
actions following such failures (degraded reads/access). 
To that end, there has been tremendous interest in both 
coding theory and storage systems research communi- 
ties to build new erasure codes with good repairability 
properties, as well as building robust storage systems 
leveraging on the novel codes (for instance, Windows 
Azure Storage using Local Reconstruction Codes). In 
this paper we explore an alternate design, looking at an 
instance of product codes |(25|. A traditional erasure 
code is first applied on individual data objects, followed 
by the creation of RAID-4 like parity over erasure en- 
coded pieces of different objects, creating cross-object 
redundancy. This results in high fault tolerance (pro- 
vided by the traditional code) and cheap repairs (pro- 
vided by the parity code). The approach is simple, and 
based on mature techniques that have long been used as 
stand-alone approaches (these are desirable for practical 
and implementation considerations), yet it achieves very 
good (less communication & computation) repairability 
and degraded data access under many fault-conditions. 
We accordingly build the CORE storage primitive as a 
general purpose, block level, fault-tolerant, data storage 
layer that can be readily integrated into distributed file 
systems relying on an underlying block level storage, 
providing significant performance boost. We integrate 
CORE into Hadoop Distributed File System (HDFS), 
and benchmark it over a wide range of system configu- 
rations, comparing it with state-of-the-art alternatives to 
demonstrate its efficacy. 

CORE builds upon our recent work |T| where we made 
a simple observation - by introducing a RAID-4 like par- 
ity over a small set of erasure encoded pieces, it is possi- 
ble to achieve significant reduction in the expected cost 
to repair lost redundancy, and by doing so across differ- 
ent independent objects, the overall fault- tolerance of the 
resulting system of data objects is marginally different 
with respect to what is achieved with optimal (maximum 



distance separable) codes. This naturally translates also 
into better degraded reads. We would like to highlight 
the fact that bandwidth gain is not the only advantage of 
CORE over other codes. In fact, since CORE uses simple 
XOR parity for cross-object encoding, the computational 
cost of repairs can also be significantly less. 

The main contributions of this work are as follows: 
(i) While this work builds on the preliminary observa- 
tion 1 1 1 (and more generally, on the idea of product codes 
|[25|), from the theoretical perspective, the main contri- 
bution of this paper is a rigorous analysis and confirma- 
tion of the intuitions advanced in |1|, and comparison 
with traditional erasure codes as well as novel storage- 
centric code, namely local reconstruction code used in 
Azure, (ii) A more practical contribution is from the sys- 
tems research perspective, where we implement the ideas 
to build the general purpose block level storage primi- 
tive CORE, and integrate it with a popular distributed 
file system, HDFS (available at p2|). (iii) In the pro- 
cess, we identify a few ways to optimize the HDFS- 
RAID 1 18] implementation, on which we build CORE, 
(iv) We design novel algorithms to understand the fail- 
ure pattern and exploit the better flexibility afforded by 
core's code design, in order to achieve fast and cheap 
repairs. The implementation is meticulously tested and 
benchmarked in a proprietary cluster as well as on EC2. 

To repair single failures, CORE consumes 50% less 
bandwidth and is between 43% to 76% faster compared 
the classic erasure code. In case of double failures and 
in the worst case scenario -when both failed blocks be- 
long to the same file- it consumes 16% less bandwidth 
and is 13% to 59% faster. We thus hope that CORE'S 
design and analysis is not only academically interesting, 
but the performance boost it achieves, and the fact that 
it is based on a simple composition of existing mature 
techniques (instead of relying on proprietary techniques 
or other untested novel approaches) makes it a serious 
candidate for wide-scale adoption. 

2 Related Work 

Erasure codes have long been explored as a storage 
efficient alternative to replication for achieving fault- 
tolerance 1 17] in the peer-to-peer (P2P) systems lit- 
erature, and have led to numerous prototypes, e.g., 
OceanStore 1 19] and TotalRecall [26] to name a few. In 
recent years erasure codes have gained traction | [27| even 
in main- stream storage technologies such as RAID |[6j. 
The ideas from RAID systems are in turn permeating to 
Cloud settings ||9||24|, and erasure codes have become 
an integral part of many proprietary file systems used in 
data-centers |7^,^, as well as open-source variants 1 18]. 

With the proliferation of erasure codes in storage- 
centric applications, there has been a corresponding rise 
in the exploration of novel erasure codes which cater to 



the nuances of distributed storage systems. Specific as- 
pects that have been investigated in designing such new 
coding techniques include: (i) efficient degraded data ac- 
cess |10|1H , (ii) good repairability | [2p6| by either com- 
bining standard codes fTppS] , applying network coding 
techniques |2j|4j|20|, or designing completely new codes 
with lower repair fan-in p3}|T5| , and (iii) fast creation 
of erasure coded redundancy p21""22] . 

Despite the plethora of works investigating novel era- 
sure codes, most existing distributed file systems us- 
ing erasure codes do so by adapting traditional erasure 
codes. Microsoft's Windows Azure Storage |8| is a 
prominent exception which uses an optimized version of 
Pyramid codes |10| called Local Reconstruction Code 
(LRC) [llj . Two recent academic prototypes - NCFS 
| [28| and NCCloud ||29| likewise explore the feasibility 
of applying network coding techniques for repairing lost 
data. In contrast to these systems based on novel erasure 
codes, CORE composes two mature techniques (standard 
erasure codes and RAID-4 like parity) based on a very 
simple and thus easy to realize design, while achieving 
very good repairability and degraded read performance. 

Although there has been a lot of interest in design- 
ing new erasure codes with good repairability and de- 
graded read properties for data-center storage infrastruc- 
ture, Windows Azure Storage is the only operational sys- 
tem, to the best of our knowledge, that integrates one 
of such new codes |TT| . In contrast, CORE does not 
aim to present a new coding technique per se, but com- 
poses existing mature techniques to achieve very good 
repairability, while keeping the system design extremely 
simple (our HDFS integrated implementation is available 
at 1 12]). This makes CORE the first general purpose era- 
sure coding based storage primitive designed with the 
specific goal of good repairability and degraded read per- 
formance, in addition to resilience against faults, which 
is the sole design goal, for which erasure codes are used 
in existing systems like HDFS-RAID |18|. 

3 Background 

We next provide some background on what are erasure 
codes and how they are used in distributed storage sys- 
tems, followed by a discussion on local repairability 
which has come to the fore in the design of novel storage- 
centric erasure codes. Finally, we discuss the code used 
in Azure system, which, to the best of our knowledge, 
was the first deployment of repairable codes in a large- 
scale commercial cloud storage system. 

3.1 Classic Erasure Codes 

Traditionally, large data objects have been stored by 
splitting them into blocks of (say) size q bits, which are 
then replicated across multiple storage nodes. In con- 
trast, an (n^k) erasure code takes k different data blocks 
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of size q, and computes m = n — k parity blocks of the 
same size, each to be stored in a different storage node. 
Then, in the event of disk failures, the k original blocks 
can be reconstructed by collecting and decoding a subset 
of > ^ blocks out of the total n stored blocks. 

Consider that the vector o= {oi^ . . . ^o^) denotes a data 
object composed of k blocks of q bits each. That is, each 
block Of is a string of q bits. The encoding operations 
are performed using finite field arithmetic where the two 
bits {0, 1} form a finite field F2 of two elements, while 
Oi likewise belongs to the binary extension field con- 
taining 2^ elements. Then, the encoding of the object o is 
a linear transformation defined by a ^ x ^ generator ma- 
trix G such that we can obtain an ^-dimensional code- 
word c = (ci , . . . , c„) of size nx q bits by applying the 
linear transformation c = o • G. A code with such a gen- 
erator matrix G is usually referred to as an (^,^)-code. 
When the generator matrix G has the form G = [4,G'] 
where 4 is the identity matrix and G' is a ^ x m matrix 
(m = n — k), the codeword c becomes c = [o, p] where o 
is the original object, and p is a parity vector containing 
mx q parity bits. The code is then said to be systematic, 
in which case the k parts of the original object remain 
unaltered after the coding process. We want to note that 
the main advantage of systematic codes is that the origi- 
nal data o can be accessed without requiring a decoding 
process, by just reading the systematic blocks of c. 

The above encoding process stretches the original data 
by a factor ofn/k (ratio known as the stretch factor), oc- 
cupying n/k times more storage space than the size of 
the original object. By choosing a suitable code with a 
stretch factor satisfying ^/^ < r, significant storage space 
savings can be achieved in comparison to a system us- 
ing r replicas. Finally, an optimal erasure code in terms 
of the trade-off between storage overhead and fault tol- 
erance is called a maximum distance separable (MDS) 
code, and has the property that the original object o can 
be reconstructed from any k out of the total n = k-\-m 
stored blocks (i.e., = k), tolerating the loss of any ar- 
bitrary m = n — k blocks. The fault- tolerance or MDS 
erasure codes has been variously analyzed and compared 



with replication |[T7][26||, providing guidelines to choose 
suitable code parameters n and k for a desired level of re- 
silience under an expected level of failures of individual 
storage nodes. 

3.2 Locally Repairable Codes 

A critical drawback of MDS codes is their high recon- 
struction cost. Repairing/reading a single failed block 
requires to download an amount of information equiv- 
alent to the size of the whole data object o, which is k 
times larger than the amount of data being repaired/read. 

Since repairs and degraded reads are frequent in stor- 
age systems, several recent works |[10|[TT][T3l{T5l have 



looked at reducing the number of blocks needed to carry 
out the repair/reconstruction of an inaccessible block 
(which is needed for both access and repair). Such a 
property is achieved by introducing 'local dependencies' 
among encoded blocks, and can be called repair locality. 

Local repairability is achieved when a block Ci can 
be expressed as a linear combination ofd(d< k) other 

blocks, Ci = aic[ -\- a2C2 -\ VoCd^'d^ and where c'j e c 

s.t. c'j ^ Ci, where the coefficients aj G ¥2^ have prede- 
termined values. This local repairability property allows 
to reduce the number of blocks accessed and transferred 
during degraded reads or repairs from k to d, where d 
can be as small as J = 2 |[T4j[T5|. Unfortunately, achiev- 
ing such code locality leads to poorer fault-tolerance for 
a given storage overhead in comparison to MDS codes. 
Hence, the design of such codes poses a trade-off be- 
tween three important desirable system properties: (i) 
high fault-tolerance, (ii) low storage overhead, and (iii) 
efficient repairs and degraded reads. 

3.3 Local Reconstruction Code in Azure 

Local Reconstruction Code (LRC) 1 1 1 1 used in the Azure 
system is an instance of Pyramid codes 1 10] optimized 
to achieve a good trade-off among these desirable prop- 
erties. In its simplest form, an (n^k) LRC code (for even 
^s, and ^ > ^ + 2) is a code composed of two classic op- 
timal (MDS) erasure codes: (i) a systematic {n — 2^k)- 
code with a global generator matrix Gg of the form 
Gg = [h^H], and (ii) a systematic {k' + l,^')-code with 
local generator matrix G/ of the form G/ = [4', ij], for 
k' = k/2. Then, the LRC encoding process consists of 
splitting the original data vector o into two equal- sized 
vectors o = (01,02) (recall that k is even) and perform 
three independent encoding operations: 

^g = ^-Gg = [o,Pg], 
ci =01 -G/ = [oi,pi], 

C2 = 02-G/ = [02, P2]. 

Then, the final codeword of the LRC code c can be ob- 
tained by concatenating the parity vectors of the three 
previous independent encodings: c = [oi , 02, Pi , P2, P^] • 
The local reconstruction property of this composed 
code can be shown as follows. When a single codeword 
block Ci G c is missing, it can be repaired by reconstruct- 
ing either Oi (if G Ci) or 02 (if Ci G C2), and regenerat- 
ing the missing block Ci from it. This repair mechanism 
entails transferring k' = k/2 redundancy blocks over the 
network, a half of the traffic required by a MDS {n^k) 
erasure code. However, when the missing block c/ does 
not belong to Ci or C2 (or Ci G Pg), the repair cannot ex- 
ploit the local reconstruction property and has to repair 
Ci using the global code, transferring then k redundancy 
blocks. It means that LRC codes only allow to locally 
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Figure 1: Example of a (10,6) LRC code encoding an object 
o = (01,02) = (6>i , . . . , 6>6). A MDS (4,3) erasure code (the lo- 
cal code) generates two parity blocks (pi^i and P2,i), and an- 
other MDS (8,6)-code (the global code) generates the other two 
parity blocks (pg^i and Pg^i)- 
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Figure 2: Example of a simple product code. The blue par- 
ity blocks are generated using a horizontal (5,3) MDS code 
whereas the green blocks are simple parity checks of each col- 
umn (or a (3,2) code). 



repair 2(^' + 1) = k-\-2 blocks out of the total n blocks 
in the codeword. The remaining n — k — 2 blocks have to 
be repaired by downloading k blocks. 

In Fig.[T]we depict an example of how this (10,6) LRC 
code encodes the different systematic blocks from two 
different sub-objects oi and 02. Repairing any block 
from o, pi or p2 can be done by accessing three other 
blocks, e.g., 01^2 = <^i,i +<^i,3 + 7^1,1 • However, repairing 
a block from requires to access six other blocks. 

Besides being able to repair single block failures, 
LRC can also repair the simultaneous failure of multi- 
ple blocks. LRC can repair any combination of n — k — 2 
failures using the global code; and can repair up to ^ — ^ 
simultaneous failures by combining local and global re- 
pairs. On an average, repairing any one missing block in 
LRC requires ( ^ ) | + ( ) k = {2kn -k^- 2k) /2n 
blocks. In Section |5] we will analyze in more detail the 
resiliency of LRC to multiple block failures. 

4 Cross-Object Redundancy 

The reconstruction locality of LRC used in Azure offers 
good read performance under degraded system condi- 
tions. However, Pyramid codes 1 10] (the framework be- 
hind LRC) were not originally conceived for efficient re- 
pairs per say, and as shown in the above example, not all 
blocks can be efficiently repaired. Although LRC codes 
can significantly reduce the repair traffic as compared to 
MDS codes, there are still cases where data locality can- 
not be fully exploited to repair arbitrary missing blocks. 

We next explore how product codes [25 1 can achieve 
good repairability without compromising either the de- 
graded read performance or the fault-tolerance of the 
code. Specifically, by combining a long and a short linear 
erasure code, we realize a product code with high fault 
tolerance (provided by the long code) and high repair 
locality (provided by the short code). This is achieved 
by encoding multiple already-encoded objects together 
(or cross-object encoding), thus reusing existing encod- 
ing/decoding/repair mechanisms already deployed in a 
distributed storage system, facilitating an organic inte- 



gration of the approach. 

Example 1. Suppose that we have two different data ob- 
jects Oi = (6>ii,(9i2,6>i3) and 02 = {021,022,023) to be en- 
coded with a (5,3) systematic MDS erasure code (with a 
generator matrix Go). Then, we obtain the codewords: 

Ci =0i -Go = {oiuOi2,Oi3,pn,P\2), 
C2 = 02-Go = {021,022,023, P2l,P22)- 

By grouping symbols from Ci and C2 in a per-column 
basis, we obtain the set of vectors 

^ = {{on,02i), {012,022), {013,023), 

{pn,P2l), {pn,P2l), {P12,P22)}- 

We encode then each vector Xi e ^ ( cross-object encod- 
ing) with a (3,2) systematic code (a simple parity check 
code, or SPC), with generator matrix Gg = [/2,l2]j, 
where I2 is the identity matrix, and I2 is a vector with two 
ones. For each xt ^ we obtain p^^^ = xi • Gg = [xi^Pgj], 
where Pg = The vector with all the cross-object 

parity blocks, Pg = (Pg,i, . . . ,Pg,5), contains: 

Pg = {on^02i, 012^022, 013^023, Pn^P2\, Pn^P22)' 

In Fig. |2] we depict this two-phase encoding process. 
Note that Pg can be viewed as the Reed Solomon en- 
coding of the respective parities of the systematic sym- 
bols. We refer to a code with a generator matrix G that 
takes a composed data object o = (01,02) and encodes 
it to a codeword c = o • G = [ci,C2,Pg], as the product 
code of Gg and Gq. It is easy to see how this example 
product code repairs any single missing block by using 
the outer erasure code Gg, e.g., we can repair 6>i 1 using 
oi^i = <^2,i +/^g,i- In addition, in case of more than one 
failure per ''column the code still has the opportunity 
to repair up to two failures per codeword Cf, and up to 
two failures within the additional parity vector pg. 

Definition of CORE'S Product Code: Let Gc and Go 

respectively be the generator matrices of an {nc,kc) and 
an {no,ko) code. Then, the product code of Gc and 
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Go is a {ncUo^kcko) linear code with generator matrix 
G = Go, where the operator (g) represents the Kro- 
necker product. In the case of the product code used 
in CORE, we will consider that the single parity check 
(SPC) with generator matrix Gc = i.e., a + 1,^) 

MDS erasure code over ¥2^ is the vertical code. For an 
input o = ((9i , . . . , (9^), (9/ G F^, this code generates a sys- 
tematic codeword c = (ci , . . . , Q+i ) = ((9i , . . . , (9/ , q+i ), 
where q+i = Y!i=i ^i- Since ¥2^ is a binary extension 
field the last symbol in the codeword corresponds to the 
exclusive-or (XOR) of the t original symbols. It can re- 
pair any single erasure in the codeword by xoring the 
remaining t symbols. The inner (horizontal) code Go 
used in CORE is a MDS k) erasure code. For the sake 
of simplicity, we will consider that it is a (n^k) Reed- 
Solomon code with generator matrix Go = [hjH], where 
H issikxm Vandermonde matrix (recall that m = n — k), 

... a^-^\ 

for any G ¥2^ . Then, the CORE'S product code is a lin- 
ear code that cross-encodes t different data objects using 
a generator matrix G = Gc Go. We will refer to such a 
code as a (n^k^t) CORE product code. 

5 Analysis of CORE'S Product Code 

In this section we evaluate CORE'S product code in terms 
of its (i) fault-tolerance, (ii) repair traffic, and (iii) effi- 
ciency of reading data in degraded situations and com- 
pare it with MDS erasure codes and LRC codes. 

5.1 Fault Tolerance 

One way to compare the data reliability of different era- 
sure codes is to measure the amount of data lost (non 
repairable objects) when a fraction p of all storage nodes 
fail simultaneously. In a large distributed storage system 
(with thousands of nodes) the average amount of data 
lost is equivalent to N x n, where is the total number 
of stored objects and n is the probability of being able to 
repair any single object when each stored block is inde- 
pendently accessible with probability p. This probability 
71 is used as a metric to quantify the fault-tolerance of a 
code and is called its static resilience. 

MDS Erasure Codes: The static resihence of an MDS 
(^, k) erasure code, denoted as tte, can be measured as the 
probability that at most m = n — k redundant blocks are 
inaccessible, i.e., tTe = Vv{B{n^p) < m), where B{n,p) 
represents a Binomial variate describing the number of 
inaccessible blocks in a set of n blocks when nodes are 
inaccessible with probability p, hence, 

Fr{B{n,p)<m) = l^('')p\l-pr-\ 



LRC Codes: The static resilience of a k) LRC code, 
denoted as tt^, can be computed as: 

nL=Pr{B{n,p) <m-2)^ 

FY{B{n,p)=m-l)-2e{l-e)^ 
FY{B{n,p)=m)-{l-ef. 

The first summand of the expression represents the prob- 
ability that at most n — k — 2 = m — 2 blocks are in- 
accessible, and thus, data can always be reconstructed 
using the global erasure code. The second summand 
represents the probability that m — 1 blocks are un- 
available but one failure can be repaired using one 
of the local groups. Given the probability that at 
most one block per local coding group is unavailable, 
e = {k/2^1)p{l -pfl^. Then, the coefficient 20(1 - 
B) represents the probability of having only one local 
coding group with at most one unavailable block. Simi- 
larly, the third summand represents the probability that m 
blocks are unavailable and at most one block is unavail- 
able in each local coding group. 

CORE'S Product Code: UnHke MDS and LRC codes, 
measuring the static resiliency of product codes in gen- 
eral is an open problem. Muqaibel | 5 1 showed the com- 
plexity of measuring it and provided a closed-form ex- 
pression to measure the static resiliency of product codes 
for a specific scenario when m = n — k = 1 and t = I 
for ^ < 8, and an upper bound of Tic for cases where 
^ > 8. However, for the specific class of codes used 
in CORE (one dimension uses the simple parity code), 
a lower bound of the resilience Tic can be obtained by 
considering that reconstruction is feasible only if there 
is at most one inaccessible block per column. Thus, 
Tic > Pr{B{n,i^) < m), where i3- = Fr{B{t + 1,/?) < 1) 
is the probability that there is at most one inaccessible 
block per column |5 |, which can be rewritten as: 

TTc > £ - where 

i^ = {i-py^^^{t^i)p{i-py. 

In Fig. [3] we depict the resilience of the three analyzed 
codes corresponding to a stretch factor of roughly 1 .4x - 
i.e. when n/kc^ 1.4. The static resilience n is often rep- 
resented using 'number of nines', where, e.g., n > 0.999 
represents a static resilience of three nines (in general, 
nines (tt) = logio{l — 7r)~^). We see that for equiva- 
lent storage overhead, CORE'S product codes achieve a 
resilience comparable to that of MDS codes, while the 
static resilience of LRC codes degrades significantly. 

5.2 Repair Traffic Requirements 

We next compare the repair traffic requirements of (i) a 
MDS (n^ke) erasure code (denoted as EC), (ii) a (n^ki) 
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Figure 3: Static resilience 7i (in number of nines) of MDS 
Reed-Solomon code (RS), Local Reconstruction code (LRC) 
and (lower bound for) CORE'S product code as a function of 
the block/node unavailability probability p. 



LRC code as described in Section [3] and (iii) a (n^kc^t) 
product code as described in Section |4j with t is set to 
ki/2. The choice of cross-object coding parameter t = 
ki/2 allows a fair comparison between CORE codes and 
LRC, since in both cases all the systematic pieces can 
then be repaired by contacting exactly ki/2 nodes. 

Repairing a Single Block Failure: As discussed in 



Section 3.1 MDS erasure codes require the transfer of 
k blocks to repair even a single failure. In Section 3.3 
we showed that the average repair cost of single block 
failures in LRC is reduced to {2kn — k^ — 2k) /2n. How- 
ever, in the case of the CORE'S product code, any single 
failure can be repaired by transferring only t blocks. For 
r = ^/2, the LRC repair cost is larger than that of CORE, 
provided than the stretch factor is lower than two. 

Repairing Multiple Block Failures: Though repair- 
ing a single block failure requires less traffic in CORE'S 
product codes than in LRC and MDS codes, we cannot 
use the single failure model to fairly compare the repair 
traffic in CORE. Local repairs in CORE work only when 
there is at most one failure per encoding group column. If 
multiple failures occur in the same column, these failures 
cannot be repaired using the "vertical" parity, increasing 
the average repair traffic. For a fairer comparison, we 
determine the average cost of repairing all affected data 
objects when a fraction p of the storage nodes fail. 

Consider the random variable W representing the net- 
work traffic required to repair a data object when each of 
the redundant blocks of the object might fail with prob- 
ability p. Then, we can compare the repair costs of the 
different codes by measuring the conditioned probability 
Fy{W\ n), where n is the event that a given data object 
can be repaired when a fraction p of nodes fails. As dis- 
cussed in Section [STT] for large systems, we can assume 
that Pr(n) = 71, where TT is the code's static resilience. 

In CORE and LRC the traffic required to repair a 
specific failure pattern depends on the number of failed 
blocks, so we express the previous probability condi- 
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Figure 4: Average network traffic (normalized) required to re- 
pair a data object when a fraction p of the storage nodes fails. 



tioned to the number of failures, 

n 

Fr{W = w\ n) = ^Pr(W = w|5(^,/?) = /, n), 

i=0 

To compare the performance of different codes we will 
measure the expected value and variance ofW. Using the 
laws of the total expectation and the total variance: 

E{W\ n) = '£E{W = w\B{n,p) = /, n)-Fr{B{n,p) = /), 

Var(W| n)=E{W^\ n)-E{W\ Uf. 

For each of the evaluated codes, the expression E{W = 
w\B{n^p) — /, n) is measured numerically using a 
Monte-Carlo experiment, where, at each iteration / ran- 
dom blocks out of the total n blocks fail. We measure 
E(W^| n) required to obtain Var(W| n) similarly. We 
normalize the network traffic values by the size of the 
whole stored object -i.e., kq bits. 

Similarly, we determine the repair time required to 
repair each failure pattern, obtaining E(r| n) and 
Var(r I n), where T is distribution of the repair time. It is 
measured assuming a congestion free network, but with 
end nodes with a limited bandwidth capacity, and consid- 
ering that the delays in repair thus occur when a single 
node sends/receives multiple blocks. We normalize the 
repair time T by the time a node requires to download a 
whole data object (k blocks) from another node. 

Multiple Failures Evaluation: To numerically com- 
pare the network traffic W and the repair time T of each 
code, we evaluate each metric for different n and k pa- 
rameter combinations. For each stretch factor n/k value, 
we choose the best (minimum) network traffic and (min- 
imum) repair time achieved by each code. 

Fig. |4] depicts the average traffic required to repair a 
data object when a fraction p of storage nodes fail. This 
was evaluated for p e {0.01 , 0.1}, respectively represent- 
ing concurrent failure of 1% and 10% of the nodes. For 
both low and high failure probabilities, CORE'S prod- 
uct code and LRC require comparable repair traffic (LRC 
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Figure 5: Average time (normalized) required to repair a data 
object when a fraction p of the storage nodes fails. 

outperforms slightly). However, from Fig [5] we see how 
core's product code can reduce the repair time by an 
order of a magnitude, since "vertical" repairs can be con- 
currently and independently executed. Having short re- 
pair times means that the storage system recover faster 
from an unsafe state, which in turn increases the sys- 
tem's robustness. The system is thus likely to operate 
in configurations corresponding to low p values. 

5.3 Degraded Read Performance 

Finally, we compare the read performance of the differ- 
ent codes when data is accessed in a degraded system 
state, i.e., when some storage nodes and the correspond- 
ing data are (temporarily) unavailable, possibly because 
of the nodes being overloaded, network outage, or repair 
actions being yet to be carried out. We consider two dif- 
ferent typical data access scenarios: 
Centralized Degraded Read: A single node in the sys- 
tem aims to retrieve a whole data object. Such an access 
pattern may arise in applications like video-on-demand. 
Distributed Degraded Read: The object is read by k 
nodes, each accessing a different systematic block. Such 
access pattern is typical in a MapReduce application 
where k different mappers may parse k different blocks. 

As earlier, the codes were studied for varying stretch 
factors n/k. In Fig. [6] we depict the average traffic re- 
quired to retrieve a single object in the centralized de- 
graded read experiments. When the fraction of unavail- 
able nodes is low (/? = 0.01), all three codes can suc- 
cessfully retrieve the stored object without any additional 
overhead by transferring only an amount of data equiv- 
alent to the stored object. However, when the fraction 
of unavailable nodes is high {p = 0.1), CORE'S product 
code has to retrieve some extra data for the low stretch 
factor configurations. Low stretch factors in CORE 
codes impose an extra read overhead due to the num- 
ber of extra blocks that have to be retrieved during the 
repair of the unavailable systematic blocks. In Fig. [7] we 
depict the average read traffic in the decentralized read 
experiments, where k distributed processes each read a 



Figure 6: Traffic (normalized) required to read a stored object 
in the centralized degraded read experiment. We depict the re- 
sults for two different fractions p of unavailable nodes. 
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Figure 7: Traffic (normalized) required to read a stored object 
in the decentralized degraded read experiment. We depict the 
results for two different fraction p of unavailable nodes. 

different systematic block. All the codes have similar be- 
havior for low p. However for larger p, CORE'S product 
codes achieve similar performance as LRC, and we can 
see how traditional erasure codes require slightly more 
traffic than the other two codes. 

From this macroscopic experiments, we see that for 
most realistic system configurations (e.g., code's stretch 
factor between 1.5-2) and state (low p values), CORE 
achieves much better repairability, while having similar 
read performance as local reconstruction codes. Later, 
in Section [8] we carry out microscopic experiments with 
our actual implementation, studying specific fault pat- 
terns within an individual cross-object coded group, to 
demonstrate further the advantages of using CORE. 

6 CORE Recovery Algorithms 

In CORE, there are a number of choices to be made dur- 
ing the recovery. For example, one has to choose be- 
tween horizontal and vertical repair when both options 
are available for a given failure instance. Moreover, since 
recoverability in CORE depends not only on the number 
of failures, but also on their distribution in CORE'S two 
dimensional arrangement, a recoverability-checking al- 
gorithm is necessitated. 

We adopt a divide-and-conquer approach to tackle 
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these issues. Specifically, given a matrix representing 
the available and failed nodes (subsequently called the 
CORE matrix), we first split the failures into 'indepen- 
dent clusters' (defined below). Other algorithms, i.e., 
recoverability-checking and repair scheduling, can be 
performed within each cluster. We discuss these next. 

6.1 Identifying Independent Clusters 

We define disjoint subsets of failed nodes that can be 
handled without interference as independent clusters]^ 
Essentially, two different clusters should not share any 
common row or column containing failed nodes. Two 
important benefits of such clusters are (i) they allow par- 
allel repairs, (ii) they may allow partial recovery when 
the full CORE matrix in not recoverable. 

A naive way to create the clusters is as follows. Ini- 
tially, each single failure is considered a cluster. Two 
clusters are then merged if there exists at least one com- 
mon row or column on which both clusters have a failure. 
The process is continued until there are no mergeable 
clusters left. The number of clusters in a CORE matrix 
is between and m (number of rows). To investigate the 
distribution of the number of clusters based on the num- 
ber of failures, we ran our clustering algorithm on lOM 
randomly-generated failure matrices for code parameters 
(14,12,5) and varied the number of random failures from 



1 to 20 (result shown in Figure 8a) 



6.2 Recover ability- Checking Algorithm 

For coding schemes that work at the level of single ob- 
jects, given the set of failures, one can directly infer 
whether an object is recoverable. In the case of CORE, 
however, this is more subtle. For instance, objects may 
still be recoverable even if there are more than n — k 
failed blocks within a single CORE row. We first identify 
two bounds and then introduce an algorithm to determine 
an object's recoverability. 

The (Ir)Recoverability Bounds. For a {n,k,t) code: 
• the lower bound of irrecoverability, L, is: 

2x (^-^+1) 

It occurs rows are minimally irrecoverable (each 

has n — k-\-l failures) and the column indexes of their 
failures are identical (i.e., no vertical repair possible). 
• the upper bound of recoverability, U, is: 

tx{n-k)^{2k-n)xl 

This occurs when all rows are maximally recoverable 
(each has n — k failures) and have identical failure col- 
umn indexes (i.e., the remaining k — {n — k) = 2k — n 
columns can each tolerate a single failure). 



These two bounds define an interval. For any failure 
number outside of this interval, the ir/recoverability can 
be immediately decided. More precisely, if the number 
of failures is smaller than L then the pattern is recover- 
able - although, as we will see later, this is a very pes- 
simistic bound - likewise, if the number is greater than 
U, then it is certainly not recoverable. 

For all the values within the above interval (inclusive), 
the outcome depends on the distribution of the failures. 
We propose a recursive algorithm which is able to decide 
whether a given CORE matrix with a specific failure pat- 
tern is recoverable or not. At each step of the algorithm, 
all the repaired and repairable rows/columns are removed 
and the algorithm restarts with the reduced matrix as the 
new input. If it results in an empty matrix, then the pat- 
terns is recoverable, otherwise it is not. 

We implemented this algorithm and used it to carry 
out an analysis on the recoverability likelihood of differ- 



^In other parts of this paper, we also use the term computer/node 
cluster in the common sense of the word, which should not be confused 
with the failure clusters in the CORE matrix 

^ Any single-row failure pattern is always recoverable. 



ent patterns. Figure[8bj obtained from lOM random runs, 
shows the recoverability likelihood (in terms of number 
of 9's) of the CORE matrix of size (14,12,5) for all pos- 
sibly recoverable failure numbers (< U = 20). It clearly 
illustrates the fact that CORE'S lower bound of irrecov- 
erability (L = 6, in this setting) is too strict. 

6.3 Repair Scheduling Algorithms 

Many different repair schedules may exist for a given 
fault pattern. Here, we first investigate two straw man 
approaches, namely column-first and row-first, then pro- 
pose an algorithm called Recursively Generated Sched- 
ule (RGS). Analytical and experimental studies show 
that RGS outperforms the baseline approaches. 

The column-first algorithm always gives higher prior- 
ity to vertical repairs and applies horizontal repair when 
no further vertical repairs are possible. The row-first 
analogously prefers horizontal repairs. In both algo- 
rithms, while doing horizontal repairs, always the best 
candidate (the one with maximum number of failures but 
still repairable) is prioritized over the other ones. 
Recursively Generated Schedule (RGS) algorithm:. 
This algorithm first identifies the critical set of fail- 
ures (failures that decrease the minimum number of re- 
quired vertical or horizontal repairs) and repairs them 
first, along the call chain of a recursive cost function c. 
All other repairs (non-critical ones) are scheduled using 
c\ a non-recursive cost function. 

In order to identify the critical failures, we define two 
variables, v and h, as follows: 

t k 
V = ^minV{Rowi) ; h = ^ minH(Colj) 

i=\ j=\ 

in which, minV{Rowi) returns the minimum number of 
vertical repairs required by row Rowi, and minH{Colj) 
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Number of Failures 

(a) The average number of clusters versus the 
number of failures for CORE's code parame- 
ters (14,12,5) 



(b) The recoverability likelihood of the scheme 
(14,12,5) in terms of number of 9's based on 
the number of failures (or equivalently, p). 



(c) Comparing the Column-First, Row-First, 
and RGS algorithms w.r.t number of blocks re- 
quired to carry out the repair. 



Figure 8: Some results: CORE failure clusters & recoverability distributions, and performance of various repair schedules. 



returns the minimum number of horizontal repairs re- 
quired by column Colj, more precisely: 



minV (Rowi) 







mmH{Col 





m\-i 



if <{n-k) 
otherwise 

if < 1 
otherwise 



The most important element of RGS is the recursive cost 
function c(/z,v) defined as: 



c{h,v)-- 



c(h^dec(y)) -\-m 
c{dec{h)^v) -\-k 



if v>0 

if v = 
or dec{v) is not applicable 



in which dec{v) and dec{h) reflect the decreases in the 
values of v and h after a single repair is performed. 

The cost function c decreases the values of first v and 
then h by at least one unit at each recursion step until 
we reach c(0,0), which is the base casJ^ The notable 
property of the base case is that any remaining repair can 
be done either vertically or horizontally. In other words, 
there is at most one failure per column, and at most n — k 
failures per row. Therefore, all remaining repair deci- 
sions can be safely made using the static cost function 
defined below: 



c'(r) = | 



rxm 



if repaired horizontally 
if repaired vertically 



in which r denotes the number of remaining repairs for a 
given row. 

To demonstrate the differences between the repair 
schedules generated by the above three algorithms, we 
use two failure pattern examples in the CORE matrix of 
size (14,12,5): a 3-failure step-shaped pattern and a 5- 
failure plus-shaped one. These examples are shown in 



^If the failure pattern is recoverable, then c{h,v) will always reach 
the base case. 



/... 








..A 




/... 











..A 



















X 









X 











X 


X 


X 






X 


X 











X 







\... 








.../ 




V... 











.../ 



Table 1: The step-shaped and the plus-shaped failure pattern 
examples representing two classes of failure patterns. 





Row-First 


Column-First 


RGS 




Schedule 




Ci,^2,Co 


c(l,0)^c(0,0)^Ci 




Cost 


2k = 24 


2t^k = 22 


k+t = n 




Schedule 


RuR3,Co,R2 


Co,C2,i?l,i?2,Ci 


c(2,l)^c(2,0)^ 










^c(l,0)^c(0,0)^Ci 




Cost 


3k^t = 4l 


3t^2k = 39 


2t + 2k = 34 



Table 2: The analytical cost (number of blocks read ) of repair- 
ing the Step and Plus failure patterns using Row-First, Column- 
First, and RGS algorithms where k= 12 and ^ = 5. 



Table [T] It should be noted that since swapping any two 
rows or any two columns in the CORE matrix results in 
an equivalent failure matrix, each of these patterns repre- 
sents a class of failure patterns and not singular instances. 
Table |2] presents the schedules generated by each algo- 
rithm for each failure pattern along with its calculated 
cost in terms of repair traffic. The corresponding experi- 
mental results are reported in Section [8] 

Finally, we generalized our analytical study of the 
above three algorithms to include failure patterns of size 
1 to 20. The results for 10,000 randomly-generated re- 



coverable failure patterns are depicted in Figure [8c] Four 
conclusions can be drawn from this figure: (i) RGS and 
column-first perform better than row-first and this is es- 
pecially noticeable when the number of failures is very 
small (which is, in essence, the MDS code vs. CORE 
comparison); (ii) as the number of failures and conse- 
quently the number of choices to make increases, the 
benefits of RGS over column-first become more pro- 
nounced; (iii) for the large failure numbers, distinct 
schedule possibilities are limited, and all the algorithms 
perform similarly; and finally (iv) a more general con- 
clusion is that if one wishes to avoid the relatively com- 
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plex scheduling algorithms, then the naive column-first 
approach nevertheless deHvers significant benefits w.r.to 
the row-first (which is roughly like for MDS codes), 
highlighting the immediate benefits of CORE'S product 
code. 

7 Implementation 

To implement the CORE primitive, we used HDFS- 
RAID 1 18], an open-source module inspired by DiskRe- 
duce j^JTand developed at Facebook. It wraps around 
Apache Hadoop's distributed file system (HDFS) and 
provides HDFS with basic erasure coding capabilities 
(encoding and decoding). Below, we first introduce 
HDFS -RAID, then explain two optimizations that we did 
on HDFS -RAID to improve its performance, and finally 
give an overview of our implementation of CORE. 

7.1 HDFS-RAID 

HDFS -RAID embeds HDFS inside an erasure code- 
supporting wrapper file system named Distributed 
Raid File System (DRFS). DRFS supports both Reed- 
Solomon coding as well as simple XOR parity files. 
These two coding alternatives are orthogonal and used 
separately based on user preference. Furthermore, both 
provide two basic features: encoding (a.k.a RAIDing) 
data blocks and repairing the corrupt/missing blocks. 

The most important components in HDFS-RAID are 
RaidNode and BlockFixer. RaidNode is a daemon re- 
sponsible for the creation and maintenance of parity files 
for all data files. Since the default block policy of HDFS 
is not aware of the dependency relation between the data 
and parity blocks of a given file, HDFS-RAID manages 
the placement of parity blocks to avoid co-location of 
data blocks and parity blocks. The BlockFixer compo- 
nent reconstructs missing or corrupt blocks by retriev- 
ing the necessary blocks, encoding/decoding them, and 
sending the reconstructed blocks to new hosts. 

7.2 HDFS-RAID Optimizations 

In our experiments with HDFS-RAID, we noticed 
two common performance inefficiencies, and optimized 
them: 

Optl: The HDFS-RAID implementation does not 
choose a subset of k blocks to do the repair and always 
retrieves all the remaining blocks on the stripe to carry 
out a repair, independent of the number of failures. One 
straightforward optimization -without incurring changes 
into the internals of HDFS-RAID's Reed-Solomon codes 
implementation- is to retrieve exactly k blocks and as- 
sume that all other n — k blocks are missing. This opti- 
mization specifically targets the bandwidth-scarce setups 
in which the cost of retrieving a block over network out- 
weighs the cost of decoding an extra block locally. 
Opt2: The HDFS-RAID implementation implicitly as- 



sumes that there is only a single failure per stripe. In 
case there are more failures, they are discovered only 
when the read access attempts fail. These newly-detected 
failed blocks are then added to the list of failed blocks, 
and the repair process starts again. Our optimized imple- 
mentation checks for multiple failures beforehand, and 
repairs them simultaneously, amortizing the repair costs. 

7.3 CORE Implementation 

The CORE storage primitive has been organically in- 
tegrated with HDFS-RAID by extending the two main 
functionalities as described below. Since all changes 
have been made within the RAID subdirectory of the 
HDFS's code, replacing the corresponding Java library 
is sufficient to upgrade HDFS-RAID to CORE. 

RAIDing: The CORE implementation allows verti- 
cal coding across files in a given directory. The cross- 
object stripe size parameter can be configured similar to 
the stripe-size of HDFS-RAID. Then vertical encoding is 
reused in the full matrix RAIDing (first row-by-row, then 
column-by-column, for both data and parity blocks). 

Repair: An additional vertical repair option is intro- 
duced. The 2-dimensional repair feature implements all 
the algorithms discussed in Section [6j (i) failure detec- 
tion and failure matrix population, (ii) failure clustering, 
(iii) recoverability-checking, and (iv) repair scheduling. 

The correctness of our implementation was verified 
through multiple test cases in which the MDS hash val- 
ues of the repaired files were compared against those of 
the original files. The source codes, binary distribution, 
and documentations of our implementation are available 
at jhttp : //sands . see . ntu . edu . sg/StorageCORE( 

8 Experiments 

We benchmarked the implementation with experiments 
run on two different HDFS clusters of 20 nodes each: 

• Network- Critical cluster: A university cluster which 
has one powerful PC (4x3.2GHz Xeon Processors with 
4GB of RAM) hosting the NameNode/RaidNode and 19 
HP t5745 ThinClients acting as DataNodes. The average 
bandwidth of this cluster is 12MB/s. 

• Computation- Critical cluster: An Amazon EC2 clus- 
ter of 20 homogeneous nodes of type ml. small (approx- 
imately, 1.2 GHz 2007 Xeon Processor with 1.8GB of 
RAM). In this cluster one node is hosting the NameN- 
ode/RaidNode and the rest are used as DataNodes. The 
maximum bandwidth between EC2 ml. small instances is 
250MB/S. 

The block size {q) used was 64MB. Files were added 
to HDFS and encoded horizontally first, and then the ver- 
tical parity was computed. 

We ran two sets of experiments, a first set to compare 
the performance of CORE with that of HDFS-RAID, and 
a second set to study the repair scheduling algorithms. In 
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Figure 9: Comparing the repair performance of HDFS-RAID, HDFS-RAID-Optimized, and CORE 
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Figure 10: Performances of the repair scheduling algorithms on two different failure patterns. 



both sets, we primarily use the completion time of the 

repair process as the main comparison measure. How- 
ever, we also measured the amount of transferred data 
in each experiment (as repair traffic). The data transfer 
numbers serve two purposes: (i) to verify the correctness 
of our implementation -they must match the analytical 
numbers- and (ii) to use as a reference point in analyzing 
the completion time numbers - since the amount of trans- 
ferred data is independent of the type of cluster used. 

Finally, in all experiments the reported numbers are 
the average of 10 runs. Since the variations were small 
(up to few percents), they are omitted from the graphs. 

8.1 CORE vs. HDFS-RAID 

In these experiments, we compared three methods 
(namely, HDFS-RAID, HDFS-RAID-Optimized and 
CORE) using two different sets of coding parameters: 
(9,6,3) and (14,12,5), inspired respectively by the code 
length and storage overheads of Google's GFS and Mi- 
crosoft Azure. In each case two different failure patterns 
were enforced: a one-failure pattern represented by X 
and a two-failures pattern represented by XX. For the 
two-failures pattern, both are set to happen in the same 
object (i.e., on the row). The reason for this setting is 
two-fold: (i) it favors the HDFS-RAID since at almost 
the same cost it can repair two failures instead of one; (ii) 
if two failures happen on different rows, the experiment 
will be, in effect, a variation of the one-failure pattern. 

From the results shown in Figure [9j we can draw sev- 
eral conclusions: 

• For single failure, the overhead of CORE is less 
than 50% of HDFS-RAID. This is due to the two in- 



herent advantages of CORE: (i) single failure can be re- 
paired vertically, using far fewer blocks, and (ii) it uses 
a much cheaper XOR operation instead of expensive 
decoding/re-encoding (this is particularly significant in 
the computation-critical cluster). 

• The impact of our first HDFS-RAID optimization 



(Optl in Section 7.2) can be seen in the results (the dif- 
ference between the 2nd and the 3rd chart bars). As 
explained before, this optimization is targeted specifi- 
cally for the clusters in which network is a scarce re- 
source (part b in Figure [9]). The improvements are partic- 
ularly pronounced in cases where the number of avoided 
block retrievals are higher (e.g., one failure in the scheme 
(9,6,3)). 

• The gains from our second HDFS-RAID optimiza- 



tion (Op2 in Section [7^ are also noticeable (the 5th and 
the 6th chart bars in all setups). 

• Growth in the CORE matrix size, from (9,6,3) to 
(14,12,5), results in even higher gains, especially in clus- 
ters where computation power is scarce. 

8.2 Repair Scheduling Algorithms 

In this set of experiments, the three repair scheduling al- 



gorithms of Section |6.3| were compared using the Step 
and Plus failure patterns. HDFS-RAID has neither a no- 
tion of repair scheduling - it treats objects independently 
- nor can it fully recover from the Plus failure pattern, so 
it was not considered in the following experiments. 
These experiments were run for CORE matrix of size 



(14,12,5). The results are shown in Figure 10 and as ex- 
pected, the data part of this figure (part a) mirrors the an- 
alytical results presented in Table [2] Moreover, the com- 
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pletion time numbers (parts b and c) are also, to large 
extent, in-line with the data results. The only two dis- 
crepancies are explained below: 

• The completion time of the Column-First algorithm 
on the Plus pattern in the network-critical cluster (part b) 
is longer than expected. This is caused by the last repair 
which uses two other freshly -repaired blocks. Access- 
ing those blocks is delayed until NameNode's heartbeat- 
driven mapping tables are updated. 

• The completion time of the RGS algorithm in the 
computation-critical cluster (part c) is only slightly better 
than that of Column-First, despite applying one vertical 
repair less (see Table [2]for the schedules). This is due to 
the fact that for these patterns the RGS and Column-First 
apply the same number of horizontal repairs and these 
are the main driving factor of the cost in the computation- 
critical cluster. 

9 Conclusions & Future Work 

In this paper we demonstrated that some simple and stan- 
dard techniques (and thus easy to implement and organi- 
cally integrate) can provide significant data repair and ac- 
cess boost in erasure coded distributed storage systems. 
We studied our approach of introducing cross-object cod- 
ing on top of normal erasure coding analytically, com- 
paring it with both traditional MDS codes as well as very 
recently proposed Local Reconstruction Codes (used in 
Azure). The ideas were implemented (as the CORE stor- 
age primitive) and integrated organically with HDFS- 
RAID, and benchmarked over a proprietary cluster and 
EC2. Analytical & numerical studies, as well as ex- 
periments with the real implementation all demonstrate 
the superior performance of CORE over state-of-the-art 
techniques for data reads and repairs. While naive solu- 
tions can be readily used, in future we will like to explore 
the CORE code properties to achieve better performance 
also during data insertion/updates. The current evalua- 
tions are static, based on snapshots of the system state. 
We speculate that CORE'S better repair properties will 
yield a system in a better state over time. We will thus 
carry out trace driven experiments to study the system's 
dynamics better. 
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